- Motivation
- Biases
- Depth bias
- Composition bias
- Mean-variance correlation
- Normalisation strategies
- Feature Selection
February 2026
We derive biological insights downstream by comparing cells against each other.
But the UMI count differences makes it harder to compare cells.
Why do total transcript molecules (UMI counts) detected between cells differ?
Normalization reduces technical differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.
Depth bias: Read differences between cells
Simple library size normalization accounts for the depth bias
Mean and variance of raw counts for genes are correlated
More highly expressed genes tend to look more variable because larger numbers result in higher variance
A gene expressed at a low level tends to have a low variance across cells:
var(c(2,4,2,4,2,4,2,4)) = 1.14
A gene with the same proportional differences between cells, but expressed at a higher level will have higher variance:
var(c(20,40,20,40,20,40,20,40)) = 114.29
If we take the logs of the expression values, the variances are the same for both genes:
var(log(c(2,4,2,4,2,4,2,4))) = 0.14
var(log(c(20,40,20,40,20,40,20,40))) = 0.14
This “variable stabilising transformation” helps to remove the correlation between mean and variance
Normalization has two steps
CPM: convert raw counts to counts-per-million (CPM)
DESeq’s size factor
“This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. We named this method sctransform.”
Steps: 1. Estimate size factors using a regularized negative binomial regression model 2. Scale the counts by dividing the raw counts with the estimated size factors 3. Apply a variance stabilizing transformation to the scaled counts by calculating Pearson residuals from the negative binomial regression model
By default these are used for downstream dimensionality reduction and clustering (although in most cases you can change this)
Select genes which capture biologically-meaningful variation, while reducing the number of genes which only contribute to technical noise